The dynamics and geometry of choice in premotor cortex

The brain represents sensory variables in the coordinated activity of neural populations, in which tuning curves of single neurons define the geometry of the population code. Whether the same coding principle holds for dynamic cognitive variables remains unknown because internal cognitive processes unfold with a unique time course on single trials observed only in the irregular spiking of heterogeneous neural populations. Here we show the existence of such a population code for the dynamics of choice formation in the primate premotor cortex. We developed an approach to simultaneously infer population dynamics and tuning functions of single neurons to the population state. Applied to spike data recorded during decision-making, our model revealed that populations of neurons encoded the same dynamic variable predicting choices, and heterogeneous firing rates resulted from the diverse tuning of single neurons to this decision variable. The inferred dynamics indicated an attractor mechanism for decision computation. Our results reveal a common geometric principle for neural encoding of sensory and dynamic cognitive variables.

Supplementary Figure 1. Model fitting and model selection on synthetic data. We generated 400 trials of spike data for a single neuron from a ground-truth model with a single-barrier potential, narrow zerocentered p 0 (x) distribution and linear firing rate function. We fit the model independently on two halves of the data D 1 and D 2 (200 trials each) and compare features between the models inferred from each data halve to select the optimal model. a, Relative log-likelihood ([log L gt − log L ] / log L gt , where L gt is the likelihood of the ground truth model) decreases with the epoch number indicating an improving fit on the training data.b, Jensen-Shannon (JS, y-axis) divergence between the models with the same feature complexity (FC, x-axis) inferred from two data halves. Colored dots indicate three levels of FC that illustrate underfitting (mint), optimal (blue), and overfitting (purple) regimes. The optimal model is selected at FC where JS divergence exceeds a fixed threshold JS thres = 0.0015. The JS threshold was the same for all results in this work. c, Noise magnitude D for each training epoch for models fitted on D 1 and D 2 . Black line indicates the ground-truth value of D. d, The ground-truth potential (black) and the potentials of the models fitted on D 1 and D 2 (colored lines) for low feature complexity (mint dot in b). The inferred potentials overlap but miss some of the ground-truth features, which is the underfitting regime. e, Same as d for the initial state distribution p 0 (x). f, Same as d for the tuning function f (x). g-i Same as d-f for the optimal feature complexity (blue dot in b). The models inferred from two data halves overlap and match the ground-truth model. j-l Same as d-f for high feature complexity (purple dot in b). The models inferred from two data halves diverge and disagree with the ground-truth, which is the overfitting regime. , tuning functions (middle right) and noise magnitudes (right) for a left-hard condition obtained from two data halves D 1 and D 2 . This model was selected using our model selection method based on feature consistency. b, Same as a for the right-hard condition. The slope of the inferred potential points towards the opposite boundary than in the left-hard condition, while the inferred tuning function is the same. c, Same as a for the left-easy condition. The inferred tuning function is the same as in the hard conditions in the left part of the domain where the dynamics evolve towards the correct choice, but the tuning function is inferred less accurately in the right side of the domain (grey highlight) due to very small number of rightward choices (error trials) in the left-easy stimulus condition. d, Same as c for the right-easy condition. e, Shared optimization across all four stimulus conditions, in which p 0 (x), tuning functions, and noise magnitude are restricted to be the same and only the potential Φ(x) can vary across stimulus conditions. The shared optimization enables more accurate inference of tuning functions, because it learns a single tuning function across four conditions so that the number of leftward and rightward choices are approximately balanced in the data, and the dynamics equally explore both sides of the decision manifold.  Figure 3. Inference of dynamics with different complexity from synthetic singleneuron data. We generate spike data for a single neuron using our model with varying complexity of the potential and tuning function, with the amount of data (800 trials total) similar to our PMd recordings. We perform the model fitting and model selection to test how the inference accuracy depends on the complexity of the ground-truth dynamics. a, The inferred potential (left), p 0 (x) distribution (middle left), tuning function (middle right) and noise magnitude (right) for a model with a single-barrier potential and a linear tuning function. The inferred model (blue) tightly overlaps with the ground-truth (black). b, Same as a for a model that has a more complex potential with two barriers and a non-linear tuning function. The inferred model matches the ground truth, but the estimation uncertainty is higher and the inference is less accurate in the regions that are poorly explored by the dynamics in the data (the region with high potential near the left boundary). The inference accuracy improves when fitting data from a population of neurons ( Supplementary  Fig. 4). c, Shared optimization across two stimulus conditions for the same model as in b. In two conditions, the potentials are mirror images of each other, and p 0 (x), D and tuning function are the same. In the shared optimization, we restrict p 0 (x), D and the tuning function to be the same across conditions. The shared optimization leads to a lower estimation uncertainty and more accurate inference (cf. to b). The inference accuracy further improves when fitting data from a population of neurons using shared optimization ( Supplementary Fig. 5). In all panels, error bars are s.t.d. across 10 bootstrap samples.
x Latent state,  Figure 4. Inference of complex dynamics from synthetic population data. We generate spike data for a population of 9 neurons using our model with a complex two-barrier potential (same is in Supplementary Fig. 3b) and heterogeneous non-linear tuning functions, with the amount of data (800 trials total) similar to our PMd recordings. We perform the model fitting and model selection to test the inference accuracy. a, The inferred potential (blue) tightly overlaps with the ground truth (black). The inference is inaccurate near the left boundary, since this region with the high potential is poorly explored by the dynamics in the data. The inference accuracy further improves for the shared optimization across multiple stimulus conditions ( Supplementary Fig. 5). b, The inferred p 0 (x) distribution (blue) tightly overlaps with the ground truth (black). c, The inferred tuning functions (blue) tightly overlap with the ground truth (black). The inference of tuning functions is more accurate compared to the single-neuron models (cf. Supplementary  Fig. 3), because spikes of multiple neurons in the population contribute to a more precise estimation of the latent state, which in turn enables a more accurate inference of the tuning function of each neuron. In all panels, error bars are s.t.d. across 10 bootstrap samples.
x Latent state, We generate spike data for a population of 9 neurons using the same model as in Supplementary Fig. 4 for two stimulus conditions. In two conditions, the potentials are mirror images of each other, and p 0 (x), D and tuning function are the same. In the shared optimization, we restrict p 0 (x), D and the tuning functions to be the same across conditions. We perform the model fitting and model selection to test the inference accuracy. a, The inferred potentials (blue) tightly overlap with the ground truth (black) in both conditions. The inference is accurate even in the regions with high potential in each condition, because the tuning functions inferred from one condition enable more accurate inference of dynamics from limited data in the other condition. b, The inferred p 0 (x) distribution (blue) tightly overlaps with the ground truth (black). c, The inferred tuning functions (blue) tightly overlap with the ground truth (black). The inference is accurate on both left and right sides of the latent space, since the regions not explored by the dynamics in one condition are explored in another condition, which enables the accurate inference of tuning functions in the entire domain.
x Latent state,  , and its tuning function shows unrealistically high firing rates in the population model as well.

Analytical calculation and numerical evaluation of the likelihood
The likelihood is calculated by marginalizing the joint probability of simultaneously observing a latent trajectory X (t) and spike data Y (t) from a model θ: We first integrate P (X (t), Y (t)|θ) over the distribution of all latent paths in between the observed spikes to obtain the joint probability P ( x t E } is a discretized trajectory which consists of the initial state x t 0 , the final state at the trial end x t E , and all states x t 1 , ...x t N at the times of spike observations from all neurons. The likelihood is obtained by marginalizing P (X(t), Y (t)|θ) over the discretized trajectory: Using the Markov property of the latent Langevin dynamics Eq. (1) and conditional independence of spike observations, the joint probability density P (X(t), Y (t)) can be factorized [1]: Here p(y t i |x t i )dt is the probability of observing a spike from neuron k i within a small dt of time t i given the latent state x t i , hence p(y t i |x t i ) = f k i (x t i ) by the definition of the instantaneous Poisson firing rate, where k i is the index of the neuron that emitted a spike at time t i . p(x t 0 ) is the probability density of the initial latent state. p(x t i |x t i−1 ) is the transition probability density from x t i−1 to x t i during the time interval between the adjacent spike observations from all neurons, which accounts for the absence of spikes during this time interval. This transition probability is marginalized over all intermediate latent paths connecting x t i−1 at time t i−1 and x t i at time t i . Finally, the term p(A|x t E ) is the absorption operator, which ensures that only trajectories terminating at one of the domain boundaries at time t E contribute to the likelihood [1]. p(x t i |x t i−1 ) is a solution of the modified Fokker-Planck equation [1]: where the term M k=1 f k (x) is the total firing rate of all neurons that accounts for the absence of spikes during the interspike intervals. The absorbtion operator is A =Ĥ 0 , whereĤ 0 is the Fokker-Planck operator [1]:Ĥ To numerically calculate the likelihood using Eq. (2), we need to solve Eq. (4) for each interspike interval. To solve Eq. (4), we introduce a Hermitian operator H = exp(Φ(x)/2)Ĥ exp(−Φ(x)/2) that propagates forward in time the scaled probability density ρ(x, t) = p(x, t) exp(Φ(x)/2): Eq. (6) is a linear equation with a symmetric differential operator that allows for an efficient numerical solution. The operator H can be decomposed into a sum of two operator: H = H 0 +H I , where H 0 accounts for drift and diffusion in the latent space, and H I accounts for the probability decay during interspike intervals: First, we find the eigenvalues and eigenvectors of the operator H 0 : To this end, we introduce scaled eigenfunctions φ 0 (x) = exp(Φ(x)/2)Ψ 0 (x), which are the solution of the following scaled eigenvalue problem: We solve the problem Eq. (9) numerically using the spectral elements method (SEM, see Supplementary Information in Refs. [1,2] for details). We obtain the set of eigenvalues λ 0,i and the eigenvectors φ 0,i (x k ), where i = 1, 2, ..., N v indexes the eigenvalues and eigenvectors, N v is the number of retained eigenvalues, and k indexes grid points in the discretized domain. For all model fits, we set N v = N − 2 = 447, where N = 449 is the size of the SEM grid. The problem Eq. (9) is a generalized eigenvalue problem with a non-trivial right-hand side function exp(−Φ(x)).
With the SEM discretization, the mass matrix M for Eq. (9) is a product of exp(−Φ(x)) with the vector of SEM weights w. The eigenvectors are orthogonal with respect to the mass matrix: where W 0 is a diagonal matrix with diagonal entries equal to the elementwise product of exp(−Φ(x)) with the vector of SEM weights w.
After finding the eigenvalues λ 0,i and eigenvectors φ 0,i (x) of the problem Eq. (9), we obtain the solution of the eigenvalue problem Eq. (8) by scaling back the eigenvectors: Ψ 0,i (x) = φ 0,i (x) exp(−Φ(x)/2), and the eigenvalues stay unchanged. It is convenient to represent the eigenvalues as a single vector λ 0 = {λ 0,i }, and the eigenvectors as a transformation matrix that contains each eigenvector as a column Q 0 = {Ψ 0 }. The scaling Ψ 0,i (x) = φ 0,i (x) exp(−Φ(x)/2) makes the new eigenvectors Ψ(x) to be orthogonal with respect to the diagonal matrix of the SEM weights W = diag(w), so that Q T 0 W Q 0 = I. Next, we find the eigenvalues and eigenvectors of the operator H: To this end, we rewrite Eq. (10) in the basis of operator H 0 . By using the definition H = H 0 +H I , Eq. (10) in the basis of operator H 0 reads: where F is the sum of tuning functions of all neurons M k=1 f k (x) discretized on the SEM grid into a diagonal matrix. After obtaining the matrix of eigenvectorsQ, which is a transformation matrix from the basis of operator H 0 to the basis of H, we find the solution of the original problem as Q = Q 0Q . The matrix Q contains the eigenvectors Ψ that solve the problem Eq. (10). The eigenvalues of Eq. (10) are the same as for Eq. (11), since the eigenvalues are basis independent.
In summary, to find the time-dependent solution of Eq. (6), we solve the corresponding eigenvalue problem Eq. (10). This problem is solved by finding the first N v eigenvalues and eigenvectors of Eq. (9), then multiplying the eigenfunctions φ 0 by exp(−Φ(x)/2) to obtain the eigenfunctions Ψ 0 of the problem Eq. (8) that constitute the matrix Q 0 . Using these eigenvectors and the eigenvalues λ 0 , we then solve another eigenvalue problem Eq. (11), from which we obtain the eigenvalues λ and the eigenvectorsQ. The eigenvalues λ and the eigenvectors Q = Q 0Q are then the solution of Eq. (10).
For numerical solution, we discretize all functions and the eigenvalue problems in the SEM grid [1]. Thus, all probability densities in Eq. (3) become vectors of size N and all operators become matrices of size N 2 , where N is the number of grid points (we set N = 449 for all model fits). The scaled transition probability density ρ(x t i |x t i−1 ) is solution of Eq. (6) and can be expressed in the basis of operator H as: where ρ i,i−1 is the transition matrix over latent states between the times t i−1 and t i of the adjacent spikes. When the scaled probability density of latent states ρ j at time t j is multiplied by the transition matrix ρ j+1,j between the times t j and t j+1 , the result is the scaled probability density ρ j+1 at time t j+1 , since the matrix-vector product marginalizes over the distribution of latent states at time t j . Similarly, the emission matrix, which gives the probability of observing a spike from neuron k conditioned on the latent state, is expressed in the basis of operator H as: where f k is a vector of the discretized tuning function f k (x). The matrix of the absorption operator A =Ĥ 0 in the H-basis is: whereQ are the eigenvectors of the problem Eq. (11). We can rewrite Eqs. (2), (3) in the finite basis of operator H to obtain the chain of vectormatrix multiplications: Here β N +2 is a column vector used to integrate the likelihood over the final state x t E . This vector is equal to β N +2 = ρ eq w, where ρ eq is a discretized vector of ρ eq (x) = √ p eq = exp(−Φ(x)/2) that scales ρ(x, t) back into p(x, t) and sums it up with the vector of SEM weights to integrate over x t E . In Eq. (15), the indices k 1 , k 2 , . . . k N refer to the index of a neuron emitting a spike at times t 1 , t 2 , . . . t N , and the spike times are ordered across all neurons.
We evaluate Eq. (15) with a forward pass that calculates the chain from left to right: with L = α N +2 T β N +2 . We use a scaling algorithm similar to that for the Hidden Markov Models [3]. After calculating each α n , we calculate c n = ||α n || and divide each α n by its norm before calculating α n+1 . With the scaling algorithm, each α n has the norm equal to 1, and the likelihood is equal to the product of all coefficients c, that is, log L = c n .

Analytical derivation of the likelihood derivatives
We have previously derived the expressions for the variational derivatives of the likelihood with respect to the force F (x), the auxiliary function F 0 (x), and the noise magnitude D [1]: Here the matrix G is computed with a forward-backward algorithm [3]. First, we perform the forward pass using Eq. (16). Then, we perform the backward path via a series of matrix-vector multiplications: We compute the matrix G during the backward pass: where the matrix Γ τ ij is equal to the negative identity matrix for τ = N + 2, Γ N +2 = −I, and otherwise: where ∆t τ = t τ − t τ −1 .
In this work, we also perform the inference of tuning functions f k (x) via the auxiliary functions F k (x) and auxiliary variables C k . To derive the analytical expressions for these derivatives, we follow similar steps as in Ref. [1]. We write the expression for the likelihood in terms of propagation and emission operators: where we use the bra-ket notation for the state vectors and operator matrices. This form is basis independent and allows for the analytical likelihood calculation. In this notation, row vectors, such as ρ T 0 , correspond to bra states ρ 0 |, propagation matrices, such as T i ,correspond to the exponential operators exp (−H(t i − t i−1 )), spike emission matrices correspond to the spike emission operators, and column vectors, such as β N +2 , correspond to ket states |β N +2 .
First, we compute the derivative of the likelihood with respect to the tuning function f k (x) of neuron k. In Eq. (21), each of the propagation operators depends on f k (x). In addition, the emission operators for which k i = k also depend on f k (x). Using the formula for product of the derivatives, we obtain: where in the second sum, the index τ 2 takes the values of indices corresponding to spikes of the neuron k. The quantities a, b,â,b are obtained via two forward and backward passes: |β n = y kn e −H∆t n+1 |β n+1 , n = 1, 2, . . . N, α 0 | = ρ 0 |e −H∆t 1 , α n | = α n−1 |y kn e −H∆t n+1 , n = 1, 2 . . . N, In the discretized space, α τ | correspond to the forward vectors α τ in Eq. (16), and |β τ correspond to the backward vectors β τ in Eq. (18) with components α i (τ ) and β i (τ ), respectively. Using the formula for the derivative of the exponential operator, we obtain [4]: where the matrix Γ is defined in Eq. (20). Operator H depends on f k (x) through the term H I = M k=1 f k (x), and we compute its variational derivative using the Euler-Lagrange equation: 15 As a result, we obtain for the first term on the right hand side of Eq. (22): Note thatĜ ij is the same as G ij defined in Eq. (19) except for the last term, since the absorption operator does not depend on the tuning function and hence does not contribute to the derivative. We similarly evaluate the second term on the right hand side of Eq. (22) and obtain the likelihood derivative with respect to the tuning function: with the matrix G (k) ij is defined as: where τ 2 runs through all time points where neuron k spiked. Next, we compute the likelihood derivatives with respect to the auxiliary function F k (x) and the auxiliary variable C k used in the optimization. We note that To evaluate the derivative of f k (x) with respect to F k (x), we introduce another auxiliary function where H(s) is the Heaviside step function. With this substitution, f k (x) = C k exp(R k (x)), hence: Using this results, we obtain: Using the results in Eqs. (29), (33), we obtain the final expressions: 16 and for the derivative with respect to the constants C k we obtain: In summary, we compute the derivatives of the likelihood with respect to F , F 0 and D using Eq. (17) derived in Ref. [1]. The derivatives with respect to F k (x) and C k for each neuron k are computed using Eqs. (34), (35) and (29).

Maximum likelihood optimization with ADAM algorithm
We optimized the model likelihood using a modified ADAM gradient-descent algorithm combined with line searches. The ADAM update on iteration t is: where α, β 1 , β 2 , are hyperparameters, m t is a running average of the gradient g t on iteration t, v t is a running average of the gradient's L 2 squared norm,m t andv t are computed from m t and v t to correct the initialization bias (at zero iteration m 0 = v 0 = 0). We apply Eqs. (36) independently to each of the model's components F (x), F 0 (x), F i (x), D, C i . The main difference between the original ADAM algorithm and our version is that we optimize over the space of continuous functions F (x), F 0 (x), and {F i (x)} and not independent discrete parameters, therefore we scale their gradients by the average function's L 2 -norm defined as φ(x) 2 = x |φ(x)| 2 dx. On each epoch of ADAM, we randomly group the trials into 20 identical trial batches. Each epoch thus consisted of 20 iterations. We then compute ADAM quantities and update each of the model components using Eqs. (36) on each batch of trials. In addition, on a subset of epochs, we performed line searches for the scalar parameters D and each C i using L-BFGS-B method from scipy.optimize.minimize toolbox. Since a line search is computationally expensive, we perform only 30 line searches spaced logarithmically over the 5,000 epochs range, such that most line searches are concentrated at early epochs.
To perform shared optimization across four stimulus conditions, we randomly split trials for each of the four conditions into 20 batches of equal size, resulting in 80 batches total. We randomly permuted the order of these 80 batches, but all trials in one batch come from the same condition. For each trial batch, we calculated the gradients and updated the model components Eqs. 36. In this case, we optimized four potential functions Φ 1 (x), Φ 2 (x), Φ 3 (x), Φ 4 (x), each of which was updated only on trial batches from the corresponding condition. Other parameters, including D, F i (x), C i and F 0 (x) were the same for all conditions, and thus were updated 80 times per epoch on all trial batches.
For bootstrapping, we first randomly divided our dataset into two equally sized data samples, and then sampled the trials randomly with replacement from each of the data samples. Thus, our bootstrap samples from two data halves did not intersect with each other but both of them contained repeated trials.

Model selection
We used the model selection procedure based on feature consistency that we developed in Ref. [1]. First, we introduce feature complexity of a model M = −S[Φ(x), D, p 0 (x); Φ R (x), D R , p R 0 (x)] defined as a negative trajectory entropy. The trajectory entropy is defined as a negative Kullback-Leibler (KL) divergence between the distributions P [X (t)] and Q[X (t)] [5]: